train <- read.csv("train.csv", header = TRUE)

Probability

Let X = GrLivArea
x = \(Q_4\) (4th quartile) of X
Y = SalePrice
y = \(Q_2\) (2nd quartile) of Y

df <- data.frame(X = train$GrLivArea, Y = train$SalePrice)
quantile(df$X, c(0, 0.25, 0.5, 0.75, 1))
##      0%     25%     50%     75%    100% 
##  334.00 1129.50 1464.00 1776.75 5642.00
quantile(df$Y, c(0, 0.25, 0.5, 0.75, 1))
##     0%    25%    50%    75%   100% 
##  34900 129975 163000 214000 755000
pdf <- function(var) {
  approxfun(density(var))
}
cdf <- function(samp, val) {
  return(integrate(pdf(samp), min(samp), min(val, max(samp)))[1]$value)
}

hist(df$X, probability = TRUE, 
     ylim = c(0, max(density(df$X)$y)))
lines(density(df$X))

plot(ecdf(df$X))

a <- seq(min(df$X), max(df$X), (max(df$X) - min(df$X)) / 100)
plot(a, sapply(a, function(z) cdf(df$X, z)), type = "l")

hist(df$Y, probability = TRUE,
     ylim = c(0 , max(density(df$Y)$y)))
lines(density(df$Y))

plot(ecdf(df$Y))

b <- seq(min(df$Y), max(df$Y), (max(df$Y) - min(df$Y)) / 100)
plot(b, sapply(b, function(z) cdf(df$Y, z)), type = "l")

(pr_A <- (nrow(df[df$X > max(df$X) & df$Y > median(df$Y), ]) / nrow(df)) / 
  (nrow(df[df$Y > median(df$Y), ]) / nrow(df)))
## [1] 0
(pr_B <- nrow(df[df$X > max(df$X) & df$Y > median(df$Y), ]) / nrow(df))
## [1] 0
(pr_C <- (nrow(df[df$X < max(df$X) & df$Y > median(df$Y), ]) / nrow(df)) /
  (nrow(df[df$Y > median(df$Y), ]) / nrow(df)))
## [1] 1

a. \(P(X > x | Y > y) = P(X > x \cap Y > y) / P(Y > y) = P(X > 5642 \cap Y > 163000) / P(Y > 163000) = (0 / 1460) / (728 / 1460) = 0\)

This is the probability that X or GrLivArea, the above grade (ground) living area in square feet, is greater than the fourth quartile or 100th percentile of that variable conditioned on the event that Y or SalePrice, the property’s sale price in dollars, is greater than the second quartile or median value of that variable.

b. \(P(X > x, Y > y) = P(X > 5642 \cap Y > 163000) = 0 / 1460 = 0\)

This is the joint probability that a property’s GrLivArea is greater than the fourth quartile of that variable and its SalePrice is greater than the second quartile of that variable.

c. \(P(X < x | Y > y) = P(X < x \cap Y > y) / P(Y > y) = P(X < 5642 \cap Y > 163000) / P(Y > 163000) = (728 / 1460) / (728 / 1460) = 1\)

This is the conditional probability that GrLivArea is less than the fourth quartile of that variable given that SalePrice is greater than the second quartile of that variable.

(cond_pr1 <- (nrow(df[df$X > max(df$X) & df$Y > median(df$Y), ]) / nrow(df)) /
  (nrow(df[df$Y > median(df$Y), ]) / nrow(df)))
## [1] 0
(indep_pr1 <- (nrow(df[df$X > max(df$X), ]) / nrow(df)))
## [1] 0
cond_pr1 == indep_pr1
## [1] TRUE
(cond_pr2 <- 
  (nrow(df[df$X > quantile(df$X, 0.75) & df$Y > median(df$Y), ]) / nrow(df)) / 
  (nrow(df[df$Y > median(df$Y), ]) / nrow(df)))
## [1] 0.4326923
(indep_pr2 <- (nrow(df[df$X > quantile(df$X, 0.75), ]) / nrow(df)))
## [1] 0.25
cond_pr2 == indep_pr2
## [1] FALSE
(cond_pr3 <- (nrow(df[df$X > median(df$X) & df$Y > median(df$Y), ]) / nrow(df)) /
  (nrow(df[df$Y > median(df$Y), ]) / nrow(df)))
## [1] 0.7884615
(indep_pr3 <- (nrow(df[df$X > median(df$X), ]) / nrow(df)))
## [1] 0.4993151
cond_pr3 == indep_pr3
## [1] FALSE
(t1 <- table(df$X > max(df$X), df$Y > median(df$Y)))
##        
##         FALSE TRUE
##   FALSE   732  728
chisq.test(t1)
## 
##  Chi-squared test for given probabilities
## 
## data:  t1
## X-squared = 0.010959, df = 1, p-value = 0.9166
(t2 <- table(df$X > quantile(df$X, 0.75), df$Y > median(df$Y)))
##        
##         FALSE TRUE
##   FALSE   682  413
##   TRUE     50  315
chisq.test(t2)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  t2
## X-squared = 256.53, df = 1, p-value < 2.2e-16
(t3 <- table(df$X > median(df$X), df$Y > median(df$Y)))
##        
##         FALSE TRUE
##   FALSE   577  154
##   TRUE    155  574
chisq.test(t3)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  t3
## X-squared = 483.29, df = 1, p-value < 2.2e-16
(t4 <- table(ceiling((ecdf(df$X)(df$X) / 0.25)), ceiling((ecdf(df$Y)(df$Y) / 0.25))))
##    
##       1   2   3   4
##   1 224 133   8   0
##   2  96 122 133  13
##   3  31  73 149 113
##   4  14  35  75 241
chisq.test(t4)
## 
##  Pearson's Chi-squared test
## 
## data:  t4
## X-squared = 908.28, df = 9, p-value < 2.2e-16

Above, I test the independence of the variables \(X\) and \(Y\) by comparing the conditional probability \(P(X > x | Y > y)\) with the probability \(P(X > x)\) for three values of \(x\), \(x = {4_Q(X), 3_Q(X), 2_Q(X)}\). In other words, I compare the conditional probabilities that GrLivArea is greater than the fourth quartile, third quartile, and median values for that variable given that SalePrice is greater than the median property sale price with the corresponding unconditioned probability of the event that GrLivArea is greater than the specified threshold values. If the two variables were independent, the conditional probability \(P(X > x | Y > y)\) would be equal to \(P(X > x)\), as the event that \(Y > y\) would provide no additional information about the likelihood of \(X\) exceeding one of the examined threshold values. Here, the conditional and unconditioned probabilities are only equal in the case where \(x = 4_Q(X)\) since there are no values in \(X\) greater than the fourth quartile and so both probabilities are equal to zero. Since the conditional and unconditioned probabilities found for the other values of \(x\) were not equal, we can conclude that the variables \(X\) and \(Y\) are not independent of one another.

Chi-squared testing on two-way contingency tables of \(X > x\) and \(Y > y\) for the threshold values of \(x\) used in the comparisons above, as well as on the contingency table comprised of the counts obtained by binning each variable at their respective quartile boundaries, confirm an association between the two variables. All chi-squared tests aside from the first on t1 where \(x = 4_Q(X)\), which comprises the counts of cases in which \(Y > y \cap X < x\) and \(Y \leq \cap X < x\) since there are no cases in which \(X > x\), yield p-values less than 0.05, so we can reject the null hypothesis that the two variables are independent.

Descriptive and Inferential Statistics

summary(df$X)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642
var(df$X)
## [1] 276129.6
sd(df$X)
## [1] 525.4804
hist(df$X)

boxplot(df$X)

summary(df$Y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  130000  163000  180900  214000  755000
var(df$Y)
## [1] 6311111264
sd(df$Y)
## [1] 79442.5
hist(df$Y)

boxplot(df$Y)

plot(df)

qqnorm(lm(Y ~ X, df)$residuals)
qqline(lm(Y ~ X, df)$residuals)

library(MASS)

## transform X
bc <- boxcox(X ~ 1, data = df, lambda = seq(-2, 2, len = 1000))

## 95% CI for lambda
range(bc$x[bc$y > max(bc$y) - 1/2 * qchisq(0.95,1)])
## [1] -0.1101101  0.1221221
lambda_X <- bc$x[which.max(bc$y)]
df$X_bc <- (df$X^lambda_X - 1) / lambda_X

## transform Y
bc <- boxcox(Y ~ 1, data = df, lambda = seq(-2, 2, len = 1000))

## 95% CI for lambda
range(bc$x[bc$y > max(bc$y) - 1/2 * qchisq(0.95,1)])
## [1] -0.16616617  0.01401401
lambda_Y <- bc$x[which.max(bc$y)]
df$Y_bc <- (df$Y^lambda_Y - 1) / lambda_Y

plot(df$X_bc, df$Y_bc)

qqnorm(lm(Y_bc ~ X_bc, df)$residuals)
qqline(lm(Y_bc ~ X_bc, df)$residuals)

plot(log(df$X), log(df$Y))

qqnorm(lm(log(Y) ~ log(X), df)$residuals)
qqline(lm(log(Y) ~ log(X), df)$residuals)

library(psychometric)
## Loading required package: multilevel
## Loading required package: nlme
(r_bc <- cor(df$X_bc, df$Y_bc))
## [1] 0.7293698
z_r_bc <- 0.5 * log((1 + r_bc)/(1 - r_bc))
se_r <- 1 / sqrt(nrow(df) - 3)
(CIr_bc <- data.frame(lower = (exp(2 * (z_r_bc - qnorm(0.995) * se_r)) - 1) / 
                        (exp(2 * (z_r_bc - qnorm(0.995) * se_r)) + 1),
                      upper = (exp(2 * (z_r_bc + qnorm(0.995) * se_r)) - 1) / 
                        (exp(2 * (z_r_bc + qnorm(0.995) * se_r)) + 1)
                      ))
##       lower     upper
## 1 0.6962049 0.7594276
CIr(r_bc, nrow(df), level = 0.99)
## [1] 0.6962049 0.7594276
(r_ln <- cor(log(df$X), log(df$Y)))
## [1] 0.7302549
z_r_ln <- 0.5 * log((1 + r_ln)/(1 - r_ln))
se_r <- 1 / sqrt(nrow(df) - 3)
(CIr_ln <- data.frame(lower = (exp(2 * (z_r_ln - qnorm(0.995) * se_r)) - 1) / 
                        (exp(2 * (z_r_ln - qnorm(0.995) * se_r)) + 1),
                      upper = (exp(2 * (z_r_ln + qnorm(0.995) * se_r)) - 1) / 
                        (exp(2 * (z_r_ln + qnorm(0.995) * se_r)) + 1)
                      ))
##       lower    upper
## 1 0.6971794 0.760228
CIr(r_ln, nrow(df), level = 0.99)
## [1] 0.6971794 0.7602280
## permuatation test on Box-Cox transformed variables
cor_coefs <- vector("numeric", 10000)
for (i in 1:10000) {
  Y_prime <- sample(df$Y_bc, length(df$Y_bc), replace = FALSE)
  cor_coefs[i] <- cor(df$X_bc, Y_prime)
}

head(sort(round(cor_coefs, digits = 3), decreasing = TRUE))
## [1] 0.105 0.091 0.086 0.082 0.082 0.081
(p_val <- sum(abs(cor_coefs) > abs(r_bc)) / length(cor_coefs)) 
## [1] 0
## 99% CI - bootstrap method on Box-Cox transformed variables
cor_coefs <- vector("numeric", 10000)
for (i in 1:10000) {
  rows <- sample(1:nrow(df), nrow(df), replace = TRUE)
  cor_coefs[i] <- cor(df[rows, ]$X_bc, df[rows, ]$Y_bc)
}
quantile(cor_coefs, c(0.005, 0.995))
##      0.5%     99.5% 
## 0.6929419 0.7638510
## permuatation test on log transformed variables
cor_coefs <- vector("numeric", 10000)
for (i in 1:10000) {
  Y_prime <- sample(log(df$Y), length(log(df$Y)), replace = FALSE)
  cor_coefs[i] <- cor(log(df$X), Y_prime)
}

head(sort(round(cor_coefs, digits = 3), decreasing = TRUE))
## [1] 0.097 0.089 0.088 0.084 0.083 0.082
(p_val <- sum(abs(cor_coefs) > abs(r_bc)) / length(cor_coefs)) 
## [1] 0
## 99% CI - bootstrap method on log transformed variables
cor_coefs <- vector("numeric", 10000)
for (i in 1:10000) {
  rows <- sample(1:nrow(df), nrow(df), replace = TRUE)
  cor_coefs[i] <- cor(log(df[rows, ]$X), log(df[rows, ]$Y))
}
quantile(cor_coefs, c(0.005, 0.995))
##      0.5%     99.5% 
## 0.6933705 0.7637877

After performing Box-Cox transformations on both \(X\) and \(Y\) using the values of the parameter \(\lambda\) with the maximum log-likelihood - and also performing simple log transformations on both variables since the 95% confidence intervals of the log-likelihood optimizing values of \(\lambda\) for each straddled zero - I computed the correlation and associated 99% confidence interval for each pair of transformed variables. Then, I tested the null hypothesis that the true correlation coefficient \(\rho\) is equal to zero against the alternative hypothesis that \(\rho\) is not equal to zero using a permutation test. Here, new sets of paired values \((x_i, y_{i'})\) were derived from the original set of paired values \((x_i, y_i)\) by randomly sampling \(y_{i'}\) without replacement from all of the values in \(y_i\), and the correlation of the permuted value pairs was calculated. This process was repeated 10,000 times and then a p-value for a two-sided test of the null hypothesis \(\rho = 0\) was calculated as the proportion of correlation coefficients in the 10,000 sets of permuted value pairs greater than the value of the correlation coefficient obtained from the original dataset. In this case, the p-value was equal to zero. I also applied the bootstrap method to approximate a sampling distribution for \(\rho\) and compute a 99% confidence interval. Here, I performed resampling with replacement of the same number of paired values as contained in the original dataset and then calculated the correlation coefficient of the resampled data. This process was also iterated 10,000 times and the resulting distribution of resampled correlation coefficients was used as an approximation of the sampling distribution for \(\rho\). The lower boundary of the 99% confidence interval was approximately 0.69, supporting the conclusion of the permutation test. In addition, the 99% confidence interval obtained through bootstrap sampling agreed closely with the confidence interval estimated earlier using the Fisher transformation. Very similar results were obtained for hypothesis testing of the correlation coefficient of both the Box-Cox transformed and log-transformed variable pairs, in other words, each pair of transformed variables provided strong evidence against the null hypotheses that the true correlation coefficients are zero.

Linear Algebra and Correlation

(cor_mat <- cor(data.frame(X_bc = df$X_bc, Y_bc = df$Y_bc)))
##           X_bc      Y_bc
## X_bc 1.0000000 0.7293698
## Y_bc 0.7293698 1.0000000
(cor_inv <- solve(cor_mat))
##           X_bc      Y_bc
## X_bc  2.136662 -1.558417
## Y_bc -1.558417  2.136662
cor_mat %*% cor_inv
##      X_bc Y_bc
## X_bc    1    0
## Y_bc    0    1
cor_inv %*% cor_mat
##      X_bc Y_bc
## X_bc    1    0
## Y_bc    0    1
cor_mat %*% cor_inv == cor_inv %*% cor_mat
##      X_bc Y_bc
## X_bc TRUE TRUE
## Y_bc TRUE TRUE

Calculus-Based Probability and Statistics

min(df$X) > 0
## [1] TRUE
(nrml_fit <- fitdistr(df$X, densfun = "normal"))
##       mean          sd    
##   1515.46370    525.30039 
##  (  13.74774) (   9.72112)
qqnorm(df$X)
qqline(df$X)

h <- hist(df$X)

rnd <- rnorm(1000, mean = nrml_fit$estimate[1], 
             sd = nrml_fit$estimate[2])
par(mfrow = c(1, 2))
plot(h)
hist(rnd,
     main = paste0("Histogram of 1000 samples", "\n", 
                   "from fitted normal", "\n", 
                   " density function"),
     xlab = paste0("Random samples from", "\n", "N(", 
                   round(nrml_fit$estimate[1], digits = 2), ", ",
                   round(nrml_fit$estimate[2], digits = 2), ")"),
     xlim = c(min(c(h$breaks, min(rnd))), max(h$breaks)))

par(mfrow = c(1, 1))

(lognrml_fit <- fitdistr(df$X, densfun = "log-normal"))
##      meanlog        sdlog   
##   7.267774383   0.333436175 
##  (0.008726424) (0.006170513)
qqnorm(log(df$X))
qqline(log(df$X))

rnd <- exp(rnorm(1000, mean = lognrml_fit$estimate[1], 
                 sd = lognrml_fit$estimate[2]))
par(mfrow = c(1, 2))
plot(h)
hist(rnd,
     main = paste0("Histogram of 1000 samples", "\n", 
                   "from fitted log-normal", "\n",
                   "density function"),
     xlab = paste0("Random samples from", "\n", "exp(N(", 
                   round(lognrml_fit$estimate[1], digits = 2), ", ",
                   round(lognrml_fit$estimate[2], digits = 2), "))"),
     xlim = c(min(c(h$breaks, min(rnd))), max(h$breaks)))

par(mfrow = c(1, 1))

Using the fitdistr function from the MASS package, I fit both normal and, informed by the work above, log-normal density functions to the independent variable \(X\). Comparison of histograms of the original, non-transformed variable and of 1000 samples generated from each of the fitted density functions indicate that while both of the fitted density functions provide good approximations of the center of the distribution of the original variable, the log-normal fit does a much better job of capturing and reflecting the positive or right skew of the original data.

Modeling

Exploratory analysis & visualization

train <- read.csv("train.csv", header = TRUE)
test <- read.csv("test.csv", header = TRUE)

str(train)
## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
##  $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
##  $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
##  $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
##  $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
##  $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
##  $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
##  $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
##  $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
##  $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
##  $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
##  $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
##  $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
##  $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
##  $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
##  $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
sapply(train, summary)
## $Id
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   365.8   730.5   730.5  1095.0  1460.0 
## 
## $MSSubClass
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0    20.0    50.0    56.9    70.0   190.0 
## 
## $MSZoning
## C (all)      FV      RH      RL      RM 
##      10      65      16    1151     218 
## 
## $LotFrontage
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   21.00   59.00   69.00   70.05   80.00  313.00     259 
## 
## $LotArea
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10520   11600  215200 
## 
## $Street
## Grvl Pave 
##    6 1454 
## 
## $Alley
## Grvl Pave NA's 
##   50   41 1369 
## 
## $LotShape
## IR1 IR2 IR3 Reg 
## 484  41  10 925 
## 
## $LandContour
##  Bnk  HLS  Low  Lvl 
##   63   50   36 1311 
## 
## $Utilities
## AllPub NoSeWa 
##   1459      1 
## 
## $LotConfig
##  Corner CulDSac     FR2     FR3  Inside 
##     263      94      47       4    1052 
## 
## $LandSlope
##  Gtl  Mod  Sev 
## 1382   65   13 
## 
## $Neighborhood
## Blmngtn Blueste  BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert 
##      17       2      16      58      28     150      51     100      79 
##  IDOTRR MeadowV Mitchel   NAmes NoRidge NPkVill NridgHt  NWAmes OldTown 
##      37      17      49     225      41       9      77      73     113 
##  Sawyer SawyerW Somerst StoneBr   SWISU  Timber Veenker 
##      74      59      86      25      25      38      11 
## 
## $Condition1
## Artery  Feedr   Norm   PosA   PosN   RRAe   RRAn   RRNe   RRNn 
##     48     81   1260      8     19     11     26      2      5 
## 
## $Condition2
## Artery  Feedr   Norm   PosA   PosN   RRAe   RRAn   RRNn 
##      2      6   1445      1      2      1      1      2 
## 
## $BldgType
##   1Fam 2fmCon Duplex  Twnhs TwnhsE 
##   1220     31     52     43    114 
## 
## $HouseStyle
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer   SLvl 
##    154     14    726      8     11    445     37     65 
## 
## $OverallQual
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   6.000   6.099   7.000  10.000 
## 
## $OverallCond
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   5.000   5.575   6.000   9.000 
## 
## $YearBuilt
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1872    1954    1973    1971    2000    2010 
## 
## $YearRemodAdd
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1950    1967    1994    1985    2004    2010 
## 
## $RoofStyle
##    Flat   Gable Gambrel     Hip Mansard    Shed 
##      13    1141      11     286       7       2 
## 
## $RoofMatl
## ClyTile CompShg Membran   Metal    Roll Tar&Grv WdShake WdShngl 
##       1    1434       1       1       1      11       5       6 
## 
## $Exterior1st
## AsbShng AsphShn BrkComm BrkFace  CBlock CemntBd HdBoard ImStucc MetalSd 
##      20       1       2      50       1      61     222       1     220 
## Plywood   Stone  Stucco VinylSd Wd Sdng WdShing 
##     108       2      25     515     206      26 
## 
## $Exterior2nd
## AsbShng AsphShn Brk Cmn BrkFace  CBlock CmentBd HdBoard ImStucc MetalSd 
##      20       3       7      25       1      60     207      10     214 
##   Other Plywood   Stone  Stucco VinylSd Wd Sdng Wd Shng 
##       1     142       5      26     504     197      38 
## 
## $MasVnrType
##  BrkCmn BrkFace    None   Stone    NA's 
##      15     445     864     128       8 
## 
## $MasVnrArea
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0     0.0   103.7   166.0  1600.0       8 
## 
## $ExterQual
##  Ex  Fa  Gd  TA 
##  52  14 488 906 
## 
## $ExterCond
##   Ex   Fa   Gd   Po   TA 
##    3   28  146    1 1282 
## 
## $Foundation
## BrkTil CBlock  PConc   Slab  Stone   Wood 
##    146    634    647     24      6      3 
## 
## $BsmtQual
##   Ex   Fa   Gd   TA NA's 
##  121   35  618  649   37 
## 
## $BsmtCond
##   Fa   Gd   Po   TA NA's 
##   45   65    2 1311   37 
## 
## $BsmtExposure
##   Av   Gd   Mn   No NA's 
##  221  134  114  953   38 
## 
## $BsmtFinType1
##  ALQ  BLQ  GLQ  LwQ  Rec  Unf NA's 
##  220  148  418   74  133  430   37 
## 
## $BsmtFinSF1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0   383.5   443.6   712.2  5644.0 
## 
## $BsmtFinType2
##  ALQ  BLQ  GLQ  LwQ  Rec  Unf NA's 
##   19   33   14   46   54 1256   38 
## 
## $BsmtFinSF2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   46.55    0.00 1474.00 
## 
## $BsmtUnfSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   223.0   477.5   567.2   808.0  2336.0 
## 
## $TotalBsmtSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   795.8   991.5  1057.0  1298.0  6110.0 
## 
## $Heating
## Floor  GasA  GasW  Grav  OthW  Wall 
##     1  1428    18     7     2     4 
## 
## $HeatingQC
##  Ex  Fa  Gd  Po  TA 
## 741  49 241   1 428 
## 
## $CentralAir
##    N    Y 
##   95 1365 
## 
## $Electrical
## FuseA FuseF FuseP   Mix SBrkr  NA's 
##    94    27     3     1  1334     1 
## 
## $X1stFlrSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334     882    1087    1163    1391    4692 
## 
## $X2ndFlrSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0     347     728    2065 
## 
## $LowQualFinSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   5.845   0.000 572.000 
## 
## $GrLivArea
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642 
## 
## $BsmtFullBath
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4253  1.0000  3.0000 
## 
## $BsmtHalfBath
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05753 0.00000 2.00000 
## 
## $FullBath
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.565   2.000   3.000 
## 
## $HalfBath
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3829  1.0000  2.0000 
## 
## $BedroomAbvGr
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   2.866   3.000   8.000 
## 
## $KitchenAbvGr
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   1.047   1.000   3.000 
## 
## $KitchenQual
##  Ex  Fa  Gd  TA 
## 100  39 586 735 
## 
## $TotRmsAbvGrd
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   5.000   6.000   6.518   7.000  14.000 
## 
## $Functional
## Maj1 Maj2 Min1 Min2  Mod  Sev  Typ 
##   14    5   31   34   15    1 1360 
## 
## $Fireplaces
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   0.613   1.000   3.000 
## 
## $FireplaceQu
##   Ex   Fa   Gd   Po   TA NA's 
##   24   33  380   20  313  690 
## 
## $GarageType
##  2Types  Attchd Basment BuiltIn CarPort  Detchd    NA's 
##       6     870      19      88       9     387      81 
## 
## $GarageYrBlt
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1900    1961    1980    1979    2002    2010      81 
## 
## $GarageFinish
##  Fin  RFn  Unf NA's 
##  352  422  605   81 
## 
## $GarageCars
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.767   2.000   4.000 
## 
## $GarageArea
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   334.5   480.0   473.0   576.0  1418.0 
## 
## $GarageQual
##   Ex   Fa   Gd   Po   TA NA's 
##    3   48   14    3 1311   81 
## 
## $GarageCond
##   Ex   Fa   Gd   Po   TA NA's 
##    2   35    9    7 1326   81 
## 
## $PavedDrive
##    N    P    Y 
##   90   30 1340 
## 
## $WoodDeckSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   94.24  168.00  857.00 
## 
## $OpenPorchSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   25.00   46.66   68.00  547.00 
## 
## $EnclosedPorch
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   21.95    0.00  552.00 
## 
## $X3SsnPorch
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    3.41    0.00  508.00 
## 
## $ScreenPorch
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   15.06    0.00  480.00 
## 
## $PoolArea
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.759   0.000 738.000 
## 
## $PoolQC
##   Ex   Fa   Gd NA's 
##    2    2    3 1453 
## 
## $Fence
## GdPrv  GdWo MnPrv  MnWw  NA's 
##    59    54   157    11  1179 
## 
## $MiscFeature
## Gar2 Othr Shed TenC NA's 
##    2    2   49    1 1406 
## 
## $MiscVal
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.00     0.00    43.49     0.00 15500.00 
## 
## $MoSold
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   6.000   6.322   8.000  12.000 
## 
## $YrSold
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2006    2007    2008    2008    2009    2010 
## 
## $SaleType
##   COD   Con ConLD ConLI ConLw   CWD   New   Oth    WD 
##    43     2     9     5     5     4   122     3  1267 
## 
## $SaleCondition
## Abnorml AdjLand  Alloca  Family  Normal Partial 
##     101       4      12      20    1198     125 
## 
## $SalePrice
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  130000  163000  180900  214000  755000
sapply(test, summary)
## $Id
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1461    1826    2190    2190    2554    2919 
## 
## $MSSubClass
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   20.00   50.00   57.38   70.00  190.00 
## 
## $MSZoning
## C (all)      FV      RH      RL      RM    NA's 
##      15      74      10    1114     242       4 
## 
## $LotFrontage
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   21.00   58.00   67.00   68.58   80.00  200.00     227 
## 
## $LotArea
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1470    7391    9399    9819   11520   56600 
## 
## $Street
## Grvl Pave 
##    6 1453 
## 
## $Alley
## Grvl Pave NA's 
##   70   37 1352 
## 
## $LotShape
## IR1 IR2 IR3 Reg 
## 484  35   6 934 
## 
## $LandContour
##  Bnk  HLS  Low  Lvl 
##   54   70   24 1311 
## 
## $Utilities
## AllPub   NA's 
##   1457      2 
## 
## $LotConfig
##  Corner CulDSac     FR2     FR3  Inside 
##     248      82      38      10    1081 
## 
## $LandSlope
##  Gtl  Mod  Sev 
## 1396   60    3 
## 
## $Neighborhood
## Blmngtn Blueste  BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert 
##      11       8      14      50      16     117      52      94      86 
##  IDOTRR MeadowV Mitchel   NAmes NoRidge NPkVill NridgHt  NWAmes OldTown 
##      56      20      65     218      30      14      89      58     126 
##  Sawyer SawyerW Somerst StoneBr   SWISU  Timber Veenker 
##      77      66      96      26      23      34      13 
## 
## $Condition1
## Artery  Feedr   Norm   PosA   PosN   RRAe   RRAn   RRNe   RRNn 
##     44     83   1251     12     20     17     24      4      4 
## 
## $Condition2
## Artery  Feedr   Norm   PosA   PosN 
##      3      7   1444      3      2 
## 
## $BldgType
##   1Fam 2fmCon Duplex  Twnhs TwnhsE 
##   1205     31     57     53    113 
## 
## $HouseStyle
## 1.5Fin 1.5Unf 1Story 2.5Unf 2Story SFoyer   SLvl 
##    160      5    745     13    427     46     63 
## 
## $OverallQual
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   6.000   6.079   7.000  10.000 
## 
## $OverallCond
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   5.000   5.554   6.000   9.000 
## 
## $YearBuilt
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1879    1953    1973    1971    2001    2010 
## 
## $YearRemodAdd
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1950    1963    1992    1984    2004    2010 
## 
## $RoofStyle
##    Flat   Gable Gambrel     Hip Mansard    Shed 
##       7    1169      11     265       4       3 
## 
## $RoofMatl
## CompShg Tar&Grv WdShake WdShngl 
##    1442      12       4       1 
## 
## $Exterior1st
## AsbShng AsphShn BrkComm BrkFace  CBlock CemntBd HdBoard MetalSd Plywood 
##      24       1       4      37       1      65     220     230     113 
##  Stucco VinylSd Wd Sdng WdShing    NA's 
##      18     510     205      30       1 
## 
## $Exterior2nd
## AsbShng AsphShn Brk Cmn BrkFace  CBlock CmentBd HdBoard ImStucc MetalSd 
##      18       1      15      22       2      66     199       5     233 
## Plywood   Stone  Stucco VinylSd Wd Sdng Wd Shng    NA's 
##     128       1      21     510     194      43       1 
## 
## $MasVnrType
##  BrkCmn BrkFace    None   Stone    NA's 
##      10     434     878     121      16 
## 
## $MasVnrArea
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0     0.0   100.7   164.0  1290.0      15 
## 
## $ExterQual
##  Ex  Fa  Gd  TA 
##  55  21 491 892 
## 
## $ExterCond
##   Ex   Fa   Gd   Po   TA 
##    9   39  153    2 1256 
## 
## $Foundation
## BrkTil CBlock  PConc   Slab  Stone   Wood 
##    165    601    661     25      5      2 
## 
## $BsmtQual
##   Ex   Fa   Gd   TA NA's 
##  137   53  591  634   44 
## 
## $BsmtCond
##   Fa   Gd   Po   TA NA's 
##   59   57    3 1295   45 
## 
## $BsmtExposure
##   Av   Gd   Mn   No NA's 
##  197  142  125  951   44 
## 
## $BsmtFinType1
##  ALQ  BLQ  GLQ  LwQ  Rec  Unf NA's 
##  209  121  431   80  155  421   42 
## 
## $BsmtFinSF1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.0   350.5   439.2   753.5  4010.0       1 
## 
## $BsmtFinType2
##  ALQ  BLQ  GLQ  LwQ  Rec  Unf NA's 
##   33   35   20   41   51 1237   42 
## 
## $BsmtFinSF2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    0.00   52.62    0.00 1526.00       1 
## 
## $BsmtUnfSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   219.2   460.0   554.3   797.8  2140.0       1 
## 
## $TotalBsmtSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0     784     988    1046    1305    5095       1 
## 
## $Heating
## GasA GasW Grav Wall 
## 1446    9    2    2 
## 
## $HeatingQC
##  Ex  Fa  Gd  Po  TA 
## 752  43 233   2 429 
## 
## $CentralAir
##    N    Y 
##  101 1358 
## 
## $Electrical
## FuseA FuseF FuseP SBrkr 
##    94    23     5  1337 
## 
## $X1stFlrSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   407.0   873.5  1079.0  1157.0  1382.0  5095.0 
## 
## $X2ndFlrSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0     326     676    1862 
## 
## $LowQualFinSF
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.000    0.000    0.000    3.544    0.000 1064.000 
## 
## $GrLivArea
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     407    1118    1432    1486    1721    5095 
## 
## $BsmtFullBath
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.4345  1.0000  3.0000       2 
## 
## $BsmtHalfBath
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0000  0.0000  0.0652  0.0000  2.0000       2 
## 
## $FullBath
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   1.571   2.000   4.000 
## 
## $HalfBath
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3777  1.0000  2.0000 
## 
## $BedroomAbvGr
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   2.854   3.000   6.000 
## 
## $KitchenAbvGr
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   1.042   1.000   2.000 
## 
## $KitchenQual
##   Ex   Fa   Gd   TA NA's 
##  105   31  565  757    1 
## 
## $TotRmsAbvGrd
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   6.385   7.000  15.000 
## 
## $Functional
## Maj1 Maj2 Min1 Min2  Mod  Sev  Typ NA's 
##    5    4   34   36   20    1 1357    2 
## 
## $Fireplaces
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5812  1.0000  4.0000 
## 
## $FireplaceQu
##   Ex   Fa   Gd   Po   TA NA's 
##   19   41  364   26  279  730 
## 
## $GarageType
##  2Types  Attchd Basment BuiltIn CarPort  Detchd    NA's 
##      17     853      17      98       6     392      76 
## 
## $GarageYrBlt
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1895    1959    1979    1978    2002    2207      78 
## 
## $GarageFinish
##  Fin  RFn  Unf NA's 
##  367  389  625   78 
## 
## $GarageCars
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   1.000   2.000   1.766   2.000   5.000       1 
## 
## $GarageArea
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0   318.0   480.0   472.8   576.0  1488.0       1 
## 
## $GarageQual
##   Fa   Gd   Po   TA NA's 
##   76   10    2 1293   78 
## 
## $GarageCond
##   Ex   Fa   Gd   Po   TA NA's 
##    1   39    6    7 1328   78 
## 
## $PavedDrive
##    N    P    Y 
##  126   32 1301 
## 
## $WoodDeckSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   93.17  168.00 1424.00 
## 
## $OpenPorchSF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   28.00   48.31   72.00  742.00 
## 
## $EnclosedPorch
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   24.24    0.00 1012.00 
## 
## $X3SsnPorch
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.794   0.000 360.000 
## 
## $ScreenPorch
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   17.06    0.00  576.00 
## 
## $PoolArea
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.744   0.000 800.000 
## 
## $PoolQC
##   Ex   Gd NA's 
##    2    1 1456 
## 
## $Fence
## GdPrv  GdWo MnPrv  MnWw  NA's 
##    59    58   172     1  1169 
## 
## $MiscFeature
## Gar2 Othr Shed NA's 
##    3    2   46 1408 
## 
## $MiscVal
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.00     0.00     0.00    58.17     0.00 17000.00 
## 
## $MoSold
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   6.000   6.104   8.000  12.000 
## 
## $YrSold
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2006    2007    2008    2008    2009    2010 
## 
## $SaleType
##   COD   Con ConLD ConLI ConLw   CWD   New   Oth    WD  NA's 
##    44     3    17     4     3     8   117     4  1258     1 
## 
## $SaleCondition
## Abnorml AdjLand  Alloca  Family  Normal Partial 
##      89       8      12      26    1204     120
## count missing values in each variable in `train` and `test`
colSums(sapply(train, is.na))[colSums(sapply(train, is.na)) > 0]
##  LotFrontage        Alley   MasVnrType   MasVnrArea     BsmtQual 
##          259         1369            8            8           37 
##     BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2   Electrical 
##           37           38           37           38            1 
##  FireplaceQu   GarageType  GarageYrBlt GarageFinish   GarageQual 
##          690           81           81           81           81 
##   GarageCond       PoolQC        Fence  MiscFeature 
##           81         1453         1179         1406
colSums(sapply(test, is.na))[colSums(sapply(test, is.na)) > 0]
##     MSZoning  LotFrontage        Alley    Utilities  Exterior1st 
##            4          227         1352            2            1 
##  Exterior2nd   MasVnrType   MasVnrArea     BsmtQual     BsmtCond 
##            1           16           15           44           45 
## BsmtExposure BsmtFinType1   BsmtFinSF1 BsmtFinType2   BsmtFinSF2 
##           44           42            1           42            1 
##    BsmtUnfSF  TotalBsmtSF BsmtFullBath BsmtHalfBath  KitchenQual 
##            1            1            2            2            1 
##   Functional  FireplaceQu   GarageType  GarageYrBlt GarageFinish 
##            2          730           76           78           78 
##   GarageCars   GarageArea   GarageQual   GarageCond       PoolQC 
##            1            1           78           78         1456 
##        Fence  MiscFeature     SaleType 
##         1169         1408            1
## check for duplicates
nrow(train) - nrow(unique(train))
## [1] 0
nrow(test) - nrow(unique(test))
## [1] 0
par(mfrow = c(2, 4))
for(row in 1:10) {
  for (i in 1:4) {
    j <- (row - 1) * 8 + i + 1
    if (j == 80) {break}
    if (is.numeric(train[[j]]) & 
        length(unique(train[[j]])) >= 12) {
      plot(density(train[[j]], na.rm = TRUE),
           main = colnames(train)[j])
    } else {
      barplot(prop.table(table(train[[j]])), 
              main = colnames(train)[j])
    }
  }
}

par(mfrow = c(2, 4))
for(row in 1:10) {
  for (i in 5:8) {
    j <- (row - 1) * 8 + i + 1
    if (j == 80) {break}
    if (is.numeric(train[[j]]) & 
        length(unique(train[[j]])) >= 12) {
      plot(density(train[[j]], na.rm = TRUE),
           main = colnames(train)[j])
    } else {
      barplot(prop.table(table(train[[j]])), 
              main = colnames(train)[j])
    }
  }
}

par(mfrow = c(2, 4))
for(row in 1:10) {
  for (i in 1:8) {
    j <- (row - 1) * 8 + i + 1
    if (j == 80) {break}
    plot(train[[j]], train$SalePrice, main = colnames(train)[j])
  }
}

par(mfrow = c(1, 1))
library(corrplot)
cors <- 
  cor(train[sapply(train, is.numeric) & 
              sapply(train, 
                     function(x) length(unique(x)) >= 5)][, -1],
      use = "na.or.complete")
corrplot(cors, method = "square")

cors[, 31]
##    MSSubClass   LotFrontage       LotArea   OverallQual   OverallCond 
##  -0.088031702   0.344269772   0.299962206   0.797880680  -0.124391232 
##     YearBuilt  YearRemodAdd    MasVnrArea    BsmtFinSF1    BsmtFinSF2 
##   0.525393598   0.521253270   0.488658155   0.390300523  -0.028021366 
##     BsmtUnfSF   TotalBsmtSF     X1stFlrSF     X2ndFlrSF  LowQualFinSF 
##   0.213128680   0.615612237   0.607969106   0.306879002  -0.001481983 
##     GrLivArea  BedroomAbvGr  TotRmsAbvGrd   GarageYrBlt    GarageCars 
##   0.705153567   0.166813894   0.547067360   0.504753018   0.647033611 
##    GarageArea    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
##   0.619329622   0.336855121   0.343353812  -0.154843204   0.030776594 
##   ScreenPorch      PoolArea       MiscVal        MoSold        YrSold 
##   0.110426815   0.092488120  -0.036041237   0.051568064  -0.011868823 
##     SalePrice 
##   1.000000000

Data cleaning & pre-processing

kv_class <- 
  data.frame(key = c(20, 30, 40, 45, 50, 60, 70, 75,
               80, 85, 90, 120, 150, 160, 180, 190),
             value = c("1StoryNew", "1StoryOld", 
               "1StoryAttic", "1.5StoryUnf",
               "1.5StoryFin", "2StoryNew",
               "2StoryOld", "2.5Story",
               "SplitLevel", "SplitFoyer",
               "Duplex", "1StoryPUD",
               "1.5StoryPUD", "2StoryPUD",
               "MultiLevelPUD", "TwoFamConvert")
  )

replace_missing <- function(dataset) {
  df <- dataset
  i <- sapply(df, is.factor)
  df[i] <- lapply(df[i], as.character)
  df$MSSubClass <- 
    sapply(df$MSSubClass, 
           function(x) kv_class[kv_class$key == x, ]$value)
  df$MSZoning[is.na(df$MSZoning)] <- "RL"
  df$LotFrontage[is.na(df$LotFrontage)] <- median(df$LotFrontage, na.rm = TRUE)
  df$Alley[is.na(df$Alley)] <- "None"
  df$Utilities[is.na(df$Utilities)] <- "AllPub"
  df$Exterior1st[is.na(df$Exterior1st)] <- "VinylSd"
  df$Exterior2nd[is.na(df$Exterior2nd)] <- "VinylSd"
  df$MasVnrType[is.na(df$MasVnrType)] <- "None"
  df$MasVnrArea[is.na(df$MasVnrArea)] <- 0
  df$BsmtQual[is.na(df$BsmtQual)] <- "None"
  df$BsmtCond[is.na(df$BsmtCond)] <- "None"
  df$BsmtExposure[is.na(df$BsmtExposure)] <- "None"
  df$BsmtFinType1[is.na(df$BsmtFinType1)] <- "None"
  df$BsmtFinSF1[is.na(df$BsmtFinSF1)] <- 0
  df$BsmtFinType2[is.na(df$BsmtFinType2)] <- "None"
  df$BsmtFinSF2[is.na(df$BsmtFinSF2)] <- 0
  df$BsmtUnfSF[is.na(df$BsmtUnfSF)] <- 0
  df$TotalBsmtSF[is.na(df$TotalBsmtSF)] <- 0
  df$Electrical[is.na(df$Electrical)] <- "SBrkr"
  df$BsmtFullBath[is.na(df$BsmtFullBath)] <- 0
  df$BsmtHalfBath[is.na(df$BsmtHalfBath)] <- 0
  df$KitchenQual[is.na(df$KitchenQual)] <- "TA"
  df$Functional[is.na(df$Functional)] <- "Typ"
  df$FireplaceQu[is.na(df$FireplaceQu)] <- "None"
  df$GarageType[is.na(df$GarageType)] <- "None"
  df$GarageYrBlt[is.na(df$GarageYrBlt)] <- min(df$GarageYrBlt, na.rm = TRUE)
  df$GarageFinish[is.na(df$GarageFinish)] <- "None"
  df$GarageCars[is.na(df$GarageCars)] <- 0
  df$GarageArea[is.na(df$GarageArea)] <- 0
  df$GarageQual[is.na(df$GarageQual)] <- "None"
  df$GarageCond[is.na(df$GarageCond)] <- "None"
  df$PoolQC[is.na(df$PoolQC)] <- "None"
  df$Fence[is.na(df$Fence)] <- "None"
  df$MiscFeature[is.na(df$MiscFeature)] <- "None"
  df$SaleType[is.na(df$SaleType)] <- "WD"
  i <- sapply(df, is.character)
  df[i] <- lapply(df[i], as.factor)
  return(df)
}

kv_bldg_type <- 
  data.frame(key = c("2fmCon", "Duplex", "Twnhs", "TwnhsE", "1Fam"),
             value = 1:5
  )

kv_ext_qual <- 
  data.frame(key = c("Po", "Fa", "TA", "Gd", "Ex"),
             value = 1:5)

kv_ext_cond <- 
  data.frame(key = c("Po", "Fa", "TA", "Gd", "Ex"),
             value = 1:5)

kv_bsmt_qual <- 
  data.frame(key = c("None", "Po", "Fa", "TA", "Gd", "Ex"),
             value = 0:5)

kv_bsmt_cond <- 
  data.frame(key = c("Po", "None", "Fa", "TA", "Gd", "Ex"),
             value = 0:5)

kv_bsmt_exp <- 
  data.frame(key = c("None", "No", "Mn", "Av", "Gd"),
             value = 0:4)

kv_heat_qc <- 
  data.frame(key = c("Po", "Fa", "TA", "Gd", "Ex"),
             value = 1:5)

kv_electrical <-
  data.frame(key = c("Mix", "FuseP", "FuseF", "FuseA", "SBrkr"),
             value = 1:5)

kv_kitchen <- 
  data.frame(key = c("Po", "Fa", "TA", "Gd", "Ex"),
             value = 1:5)

kv_fireplace_q <- 
  data.frame(key = c("Po", "None", "Fa", "TA", "Gd", "Ex"),
             value = 0:5)

kv_garage_fin <- 
  data.frame(key = c("None", "Unf", "RFn", "Fin"),
             value = 0:3)

kv_paved_drive <- 
  data.frame(key = c("N", "P", "Y"), value = 1:3)

recode <- function(dataset) {
  # categorical
  df <- dataset
  i <- sapply(df, is.factor)
  df[i] <- lapply(df[i], as.character)
  df$BldgType <- 
    sapply(df$BldgType, 
           function(x) kv_bldg_type[kv_bldg_type$key == x, ]$value)
  df$ExterQual <- 
    sapply(df$ExterQual, 
           function(x) kv_ext_qual[kv_ext_qual$key == x, ]$value)
  df$ExterCond <- 
    sapply(df$ExterCond, 
           function(x) kv_ext_cond[kv_ext_cond$key == x, ]$value)
  df$BsmtQual <- 
    sapply(df$BsmtQual, 
           function(x) kv_bsmt_qual[kv_bsmt_qual$key == x, ]$value)
  df$BsmtCond <- 
    sapply(df$BsmtCond, 
           function(x) kv_bsmt_cond[kv_bsmt_cond$key == x, ]$value)
  df$BsmtExposure <- 
    sapply(df$BsmtExposure, 
           function(x) kv_bsmt_exp[kv_bsmt_exp$key == x, ]$value)
  df$BsmtFinType1 <- 
    ifelse(df$BsmtFinType1 == "GLQ", 2, 
           ifelse(df$BsmtFinType1 == "None", 0, 1))
  df$HeatingQC <- 
    sapply(df$HeatingQC, 
           function(x) kv_heat_qc[kv_heat_qc$key == x, ]$value)
  df$CentralAir <- ifelse(df$CentralAir == "Y", 1, 0)
  df$Electrical <- 
    sapply(df$Electrical,
           function(x) kv_electrical[kv_electrical$key == x, ]$value)
  df$KitchenQual <-
    sapply(df$KitchenQual,
           function(x) kv_kitchen[kv_kitchen$key == x, ]$value)
  df$FireplaceQu <-
    sapply(df$FireplaceQu,
           function(x) kv_fireplace_q[kv_fireplace_q$key == x, ]$value)
  df$GarageType <- 
    ifelse(df$GarageType %in% 
             c("2Types", "Attchd", "Basment", "BuiltIn"), 1, 0)
  df$GarageFinish <-
    sapply(df$GarageFinish,
           function(x) kv_garage_fin[kv_garage_fin$key == x, ]$value)
  df$GarageQual <- 
    ifelse(df$GarageQual %in% c("Ex", "Gd", "TA"), 1, 0)
  df$GarageCond <- 
    ifelse(df$GarageCond %in% c("Ex", "Gd", "TA"), 1, 0)
  df$PavedDrive <-
    sapply(df$PavedDrive,
           function(x) kv_paved_drive[kv_paved_drive$key == x, ]$value)
  df$PoolQC <- ifelse(df$PoolQC == "Ex", 1, 0)
  df$MoSold <- sapply(df$MoSold, function(x) month.name[x])
  i <- sapply(df, is.character)
  df[i] <- lapply(df[i], as.factor)
  
  # binary coding
  df$MasVnrArea <- ifelse(df$MasVnrArea > 0, 1, 0)
  df$MiscVal <- ifelse(df$MiscVal > 0, 1, 0)
  df$X3SsnPorch <- ifelse(df$X3SsnPorch > 0, 1, 0)
  df$ScreenPorch <- ifelse(df$ScreenPorch > 0, 1, 0)
  df$LowQualFinSF <- ifelse(df$LowQualFinSF > 0, 0, 1)
  
  ## log transform
  df$LotArea <- log(df$LotArea)
  df$GrLivArea <- log(df$GrLivArea)
  return(df)
}

drop_outliers <- function(dataset) {
  df <- dataset
  df <- df[df$BsmtFinSF1 < 5000, ]
  df <- df[df$X1stFlrSF < 4000, ]
  return(df)
}
library(caret)
## Warning: package 'caret' was built under R version 3.3.2
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:psychometric':
## 
##     alpha
train <- replace_missing(train)
test <- replace_missing(test)

train_facts <- sapply(train[colnames(train[sapply(train, is.factor)])], function(x) sort(unique(x[!is.na(x)])))
test_facts <- sapply(test[colnames(train[sapply(train, is.factor)])], function(x) sort(unique(x[!is.na(x)])))

for (i in 1:length(test_facts)) {
  if (length(setdiff(test_facts[[i]], train_facts[[i]])) > 0) {
    print(names(test_facts)[i])
  }
}
## [1] "MSSubClass"
unique(train$MSSubClass)
##  [1] 2StoryNew     1StoryNew     2StoryOld     1.5StoryFin   TwoFamConvert
##  [6] 1.5StoryUnf   Duplex        1StoryPUD     1StoryOld     SplitFoyer   
## [11] SplitLevel    2StoryPUD     2.5Story      MultiLevelPUD 1StoryAttic  
## 16 Levels: 1.5StoryFin 1.5StoryPUD 1.5StoryUnf 1StoryAttic ... TwoFamConvert
unique(test$MSSubClass)
##  [1] 1StoryNew     2StoryNew     1StoryPUD     2StoryPUD     SplitLevel   
##  [6] 1StoryOld     1.5StoryFin   Duplex        SplitFoyer    TwoFamConvert
## [11] 1.5StoryUnf   2StoryOld     2.5Story      MultiLevelPUD 1StoryAttic  
## [16] 1.5StoryPUD  
## 16 Levels: 1.5StoryFin 1.5StoryPUD 1.5StoryUnf 1StoryAttic ... TwoFamConvert
train <- recode(train)
test <- recode(test)

train <- drop_outliers(train)

colSums(sapply(train, is.na))[colSums(sapply(train, is.na)) > 0]
## named numeric(0)
colSums(sapply(test, is.na))[colSums(sapply(test, is.na)) > 0]
## named numeric(0)
## dummy code categorical variables in `train` and `test` datasets 
dummies <- dummyVars("~ .", data = rbind(train[, -ncol(train)], test))
SalePrice <- data.frame(Id = train$Id, SalePrice = train$SalePrice)
train <- as.data.frame(predict(dummies, newdata = train))
train <- merge(train, SalePrice, by = "Id")
test <- as.data.frame(predict(dummies, newdata = test))

(nzv <- nearZeroVar(train, saveMetrics = TRUE))
##                            freqRatio percentUnique zeroVar   nzv
## Id                          1.000000   100.0000000   FALSE FALSE
## MSSubClass.1.5StoryFin      9.131944     0.1370802   FALSE FALSE
## MSSubClass.1.5StoryUnf    120.583333     0.1370802   FALSE  TRUE
## MSSubClass.1StoryAttic    363.750000     0.1370802   FALSE  TRUE
## MSSubClass.1StoryNew        1.722015     0.1370802   FALSE FALSE
## MSSubClass.1StoryOld       20.144928     0.1370802   FALSE  TRUE
## MSSubClass.1StoryPUD       15.770115     0.1370802   FALSE FALSE
## MSSubClass.2.5Story        90.187500     0.1370802   FALSE  TRUE
## MSSubClass.2StoryNew        3.895973     0.1370802   FALSE FALSE
## MSSubClass.2StoryOld       23.316667     0.1370802   FALSE  TRUE
## MSSubClass.2StoryPUD       22.158730     0.1370802   FALSE  TRUE
## MSSubClass.Duplex          27.057692     0.1370802   FALSE  TRUE
## MSSubClass.MultiLevelPUD  144.900000     0.1370802   FALSE  TRUE
## MSSubClass.SplitFoyer      71.950000     0.1370802   FALSE  TRUE
## MSSubClass.SplitLevel      24.155172     0.1370802   FALSE  TRUE
## MSSubClass.TwoFamConvert   47.633333     0.1370802   FALSE  TRUE
## MSSubClass.1.5StoryPUD      0.000000     0.0685401    TRUE  TRUE
## MSZoning.C (all)          144.900000     0.1370802   FALSE  TRUE
## MSZoning.FV                21.446154     0.1370802   FALSE  TRUE
## MSZoning.RH                90.187500     0.1370802   FALSE  TRUE
## MSZoning.RL                 3.721683     0.1370802   FALSE FALSE
## MSZoning.RM                 5.692661     0.1370802   FALSE FALSE
## LotFrontage                 1.888112     7.5394106   FALSE FALSE
## LotArea                     1.041667    73.4749829   FALSE FALSE
## Street.Grvl               242.166667     0.1370802   FALSE  TRUE
## Street.Pave               242.166667     0.1370802   FALSE  TRUE
## Alley.Grvl                 28.180000     0.1370802   FALSE  TRUE
## Alley.None                 15.032967     0.1370802   FALSE FALSE
## Alley.Pave                 34.585366     0.1370802   FALSE  TRUE
## LotShape.IR1                2.014463     0.1370802   FALSE FALSE
## LotShape.IR2               34.585366     0.1370802   FALSE  TRUE
## LotShape.IR3              161.111111     0.1370802   FALSE  TRUE
## LotShape.Reg                1.732210     0.1370802   FALSE FALSE
## LandContour.Bnk            22.532258     0.1370802   FALSE  TRUE
## LandContour.HLS            28.180000     0.1370802   FALSE  TRUE
## LandContour.Low            39.527778     0.1370802   FALSE  TRUE
## LandContour.Lvl             8.858108     0.1370802   FALSE FALSE
## Utilities.AllPub         1458.000000     0.1370802   FALSE  TRUE
## Utilities.NoSeWa         1458.000000     0.1370802   FALSE  TRUE
## LotConfig.Corner            4.568702     0.1370802   FALSE FALSE
## LotConfig.CulDSac          14.521277     0.1370802   FALSE FALSE
## LotConfig.FR2              30.042553     0.1370802   FALSE  TRUE
## LotConfig.FR3             363.750000     0.1370802   FALSE  TRUE
## LotConfig.Inside            2.584767     0.1370802   FALSE FALSE
## LandSlope.Gtl              17.705128     0.1370802   FALSE FALSE
## LandSlope.Mod              21.446154     0.1370802   FALSE  TRUE
## LandSlope.Sev             111.230769     0.1370802   FALSE  TRUE
## Neighborhood.Blmngtn       84.823529     0.1370802   FALSE  TRUE
## Neighborhood.Blueste      728.500000     0.1370802   FALSE  TRUE
## Neighborhood.BrDale        90.187500     0.1370802   FALSE  TRUE
## Neighborhood.BrkSide       24.155172     0.1370802   FALSE  TRUE
## Neighborhood.ClearCr       51.107143     0.1370802   FALSE  TRUE
## Neighborhood.CollgCr        8.726667     0.1370802   FALSE FALSE
## Neighborhood.Crawfor       27.607843     0.1370802   FALSE  TRUE
## Neighborhood.Edwards       13.737374     0.1370802   FALSE FALSE
## Neighborhood.Gilbert       17.468354     0.1370802   FALSE FALSE
## Neighborhood.IDOTRR        38.432432     0.1370802   FALSE  TRUE
## Neighborhood.MeadowV       84.823529     0.1370802   FALSE  TRUE
## Neighborhood.Mitchel       28.775510     0.1370802   FALSE  TRUE
## Neighborhood.NAmes          5.484444     0.1370802   FALSE FALSE
## Neighborhood.NoRidge       34.585366     0.1370802   FALSE  TRUE
## Neighborhood.NPkVill      161.111111     0.1370802   FALSE  TRUE
## Neighborhood.NridgHt       17.948052     0.1370802   FALSE FALSE
## Neighborhood.NWAmes        18.986301     0.1370802   FALSE FALSE
## Neighborhood.OldTown       11.911504     0.1370802   FALSE FALSE
## Neighborhood.Sawyer        18.716216     0.1370802   FALSE FALSE
## Neighborhood.SawyerW       23.728814     0.1370802   FALSE  TRUE
## Neighborhood.Somerst       15.965116     0.1370802   FALSE FALSE
## Neighborhood.StoneBr       57.360000     0.1370802   FALSE  TRUE
## Neighborhood.SWISU         57.360000     0.1370802   FALSE  TRUE
## Neighborhood.Timber        37.394737     0.1370802   FALSE  TRUE
## Neighborhood.Veenker      131.636364     0.1370802   FALSE  TRUE
## Condition1.Artery          29.395833     0.1370802   FALSE  TRUE
## Condition1.Feedr           17.237500     0.1370802   FALSE FALSE
## Condition1.Norm             6.331658     0.1370802   FALSE FALSE
## Condition1.PosA           181.375000     0.1370802   FALSE  TRUE
## Condition1.PosN            75.789474     0.1370802   FALSE  TRUE
## Condition1.RRAe           131.636364     0.1370802   FALSE  TRUE
## Condition1.RRAn            55.115385     0.1370802   FALSE  TRUE
## Condition1.RRNe           728.500000     0.1370802   FALSE  TRUE
## Condition1.RRNn           290.800000     0.1370802   FALSE  TRUE
## Condition2.Artery         728.500000     0.1370802   FALSE  TRUE
## Condition2.Feedr          242.166667     0.1370802   FALSE  TRUE
## Condition2.Norm            96.266667     0.1370802   FALSE  TRUE
## Condition2.PosA          1458.000000     0.1370802   FALSE  TRUE
## Condition2.PosN           728.500000     0.1370802   FALSE  TRUE
## Condition2.RRAe          1458.000000     0.1370802   FALSE  TRUE
## Condition2.RRAn          1458.000000     0.1370802   FALSE  TRUE
## Condition2.RRNn           728.500000     0.1370802   FALSE  TRUE
## BldgType                   10.692982     0.3427005   FALSE FALSE
## HouseStyle.1.5Fin           8.474026     0.1370802   FALSE FALSE
## HouseStyle.1.5Unf         103.214286     0.1370802   FALSE  TRUE
## HouseStyle.1Story           1.009642     0.1370802   FALSE FALSE
## HouseStyle.2.5Fin         181.375000     0.1370802   FALSE  TRUE
## HouseStyle.2.5Unf         131.636364     0.1370802   FALSE  TRUE
## HouseStyle.2Story           2.286036     0.1370802   FALSE FALSE
## HouseStyle.SFoyer          38.432432     0.1370802   FALSE  TRUE
## HouseStyle.SLvl            21.446154     0.1370802   FALSE  TRUE
## OverallQual                 1.061497     0.6854010   FALSE FALSE
## OverallCond                 3.253968     0.6168609   FALSE FALSE
## YearBuilt                   1.046875     7.6764907   FALSE FALSE
## YearRemodAdd                1.835052     4.1809459   FALSE FALSE
## RoofStyle.Flat            111.230769     0.1370802   FALSE  TRUE
## RoofStyle.Gable             3.588050     0.1370802   FALSE FALSE
## RoofStyle.Gambrel         131.636364     0.1370802   FALSE  TRUE
## RoofStyle.Hip               4.119298     0.1370802   FALSE FALSE
## RoofStyle.Mansard         207.428571     0.1370802   FALSE  TRUE
## RoofStyle.Shed            728.500000     0.1370802   FALSE  TRUE
## RoofMatl.ClyTile            0.000000     0.0685401    TRUE  TRUE
## RoofMatl.CompShg           57.360000     0.1370802   FALSE  TRUE
## RoofMatl.Membran         1458.000000     0.1370802   FALSE  TRUE
## RoofMatl.Metal           1458.000000     0.1370802   FALSE  TRUE
## RoofMatl.Roll            1458.000000     0.1370802   FALSE  TRUE
## RoofMatl.Tar&Grv          131.636364     0.1370802   FALSE  TRUE
## RoofMatl.WdShake          290.800000     0.1370802   FALSE  TRUE
## RoofMatl.WdShngl          242.166667     0.1370802   FALSE  TRUE
## Exterior1st.AsbShng        71.950000     0.1370802   FALSE  TRUE
## Exterior1st.AsphShn      1458.000000     0.1370802   FALSE  TRUE
## Exterior1st.BrkComm       728.500000     0.1370802   FALSE  TRUE
## Exterior1st.BrkFace        28.180000     0.1370802   FALSE  TRUE
## Exterior1st.CBlock       1458.000000     0.1370802   FALSE  TRUE
## Exterior1st.CemntBd        22.918033     0.1370802   FALSE  TRUE
## Exterior1st.HdBoard         5.572072     0.1370802   FALSE FALSE
## Exterior1st.ImStucc      1458.000000     0.1370802   FALSE  TRUE
## Exterior1st.MetalSd         5.631818     0.1370802   FALSE FALSE
## Exterior1st.Plywood        12.509259     0.1370802   FALSE FALSE
## Exterior1st.Stone         728.500000     0.1370802   FALSE  TRUE
## Exterior1st.Stucco         59.791667     0.1370802   FALSE  TRUE
## Exterior1st.VinylSd         1.833010     0.1370802   FALSE FALSE
## Exterior1st.Wd Sdng         6.082524     0.1370802   FALSE FALSE
## Exterior1st.WdShing        55.115385     0.1370802   FALSE  TRUE
## Exterior2nd.AsbShng        71.950000     0.1370802   FALSE  TRUE
## Exterior2nd.AsphShn       485.333333     0.1370802   FALSE  TRUE
## Exterior2nd.Brk Cmn       207.428571     0.1370802   FALSE  TRUE
## Exterior2nd.BrkFace        57.360000     0.1370802   FALSE  TRUE
## Exterior2nd.CBlock       1458.000000     0.1370802   FALSE  TRUE
## Exterior2nd.CmentBd        23.316667     0.1370802   FALSE  TRUE
## Exterior2nd.HdBoard         6.048309     0.1370802   FALSE FALSE
## Exterior2nd.ImStucc       144.900000     0.1370802   FALSE  TRUE
## Exterior2nd.MetalSd         5.817757     0.1370802   FALSE FALSE
## Exterior2nd.Other        1458.000000     0.1370802   FALSE  TRUE
## Exterior2nd.Plywood         9.274648     0.1370802   FALSE FALSE
## Exterior2nd.Stone         290.800000     0.1370802   FALSE  TRUE
## Exterior2nd.Stucco         57.360000     0.1370802   FALSE  TRUE
## Exterior2nd.VinylSd         1.894841     0.1370802   FALSE FALSE
## Exterior2nd.Wd Sdng         6.406091     0.1370802   FALSE FALSE
## Exterior2nd.Wd Shng        37.394737     0.1370802   FALSE  TRUE
## MasVnrType.BrkCmn          96.266667     0.1370802   FALSE  TRUE
## MasVnrType.BrkFace          2.278652     0.1370802   FALSE FALSE
## MasVnrType.None             1.485520     0.1370802   FALSE FALSE
## MasVnrType.Stone           10.488189     0.1370802   FALSE FALSE
## MasVnrArea                  1.472881     0.1370802   FALSE FALSE
## ExterQual                   1.856557     0.2741604   FALSE FALSE
## ExterCond                   8.773973     0.3427005   FALSE FALSE
## Foundation.BrkTil           8.993151     0.1370802   FALSE FALSE
## Foundation.CBlock           1.301262     0.1370802   FALSE FALSE
## Foundation.PConc            1.258514     0.1370802   FALSE FALSE
## Foundation.Slab            59.791667     0.1370802   FALSE  TRUE
## Foundation.Stone          242.166667     0.1370802   FALSE  TRUE
## Foundation.Wood           485.333333     0.1370802   FALSE  TRUE
## BsmtQual                    1.050162     0.3427005   FALSE FALSE
## BsmtCond                   20.153846     0.3427005   FALSE  TRUE
## BsmtExposure                4.312217     0.3427005   FALSE FALSE
## BsmtFinType1                2.410072     0.2056203   FALSE FALSE
## BsmtFinSF1                 38.916667    43.5915010   FALSE FALSE
## BsmtFinType2.ALQ           75.789474     0.1370802   FALSE  TRUE
## BsmtFinType2.BLQ           43.212121     0.1370802   FALSE  TRUE
## BsmtFinType2.GLQ          103.214286     0.1370802   FALSE  TRUE
## BsmtFinType2.LwQ           30.717391     0.1370802   FALSE  TRUE
## BsmtFinType2.None          37.394737     0.1370802   FALSE  TRUE
## BsmtFinType2.Rec           26.018519     0.1370802   FALSE  TRUE
## BsmtFinType2.Unf            6.151961     0.1370802   FALSE FALSE
## BsmtFinSF2                258.400000     9.8697738   FALSE  TRUE
## BsmtUnfSF                  13.111111    53.4612748   FALSE FALSE
## TotalBsmtSF                 1.057143    49.3488691   FALSE FALSE
## Heating.Floor            1458.000000     0.1370802   FALSE  TRUE
## Heating.GasA               44.593750     0.1370802   FALSE  TRUE
## Heating.GasW               80.055556     0.1370802   FALSE  TRUE
## Heating.Grav              207.428571     0.1370802   FALSE  TRUE
## Heating.OthW              728.500000     0.1370802   FALSE  TRUE
## Heating.Wall              363.750000     0.1370802   FALSE  TRUE
## HeatingQC                   1.728972     0.3427005   FALSE FALSE
## CentralAir                 14.357895     0.1370802   FALSE FALSE
## Electrical                 14.191489     0.3427005   FALSE FALSE
## X1stFlrSF                   1.562500    51.5421522   FALSE FALSE
## X2ndFlrSF                  82.900000    28.5126799   FALSE FALSE
## LowQualFinSF               55.115385     0.1370802   FALSE  TRUE
## GrLivArea                   1.571429    58.9444825   FALSE FALSE
## BsmtFullBath                1.455782     0.2741604   FALSE FALSE
## BsmtHalfBath               17.212500     0.2056203   FALSE FALSE
## FullBath                    1.180000     0.2741604   FALSE FALSE
## HalfBath                    1.709738     0.2056203   FALSE FALSE
## BedroomAbvGr                2.243017     0.5483208   FALSE FALSE
## KitchenAbvGr               21.400000     0.2741604   FALSE  TRUE
## KitchenQual                 1.254266     0.2741604   FALSE FALSE
## TotRmsAbvGrd                1.221884     0.8224812   FALSE FALSE
## Functional.Maj1           103.214286     0.1370802   FALSE  TRUE
## Functional.Maj2           290.800000     0.1370802   FALSE  TRUE
## Functional.Min1            46.064516     0.1370802   FALSE  TRUE
## Functional.Min2            41.911765     0.1370802   FALSE  TRUE
## Functional.Mod             96.266667     0.1370802   FALSE  TRUE
## Functional.Sev           1458.000000     0.1370802   FALSE  TRUE
## Functional.Typ             13.590000     0.1370802   FALSE FALSE
## Fireplaces                  1.061538     0.2741604   FALSE FALSE
## FireplaceQu                 1.820580     0.4112406   FALSE FALSE
## GarageType                  2.058700     0.1370802   FALSE FALSE
## GarageYrBlt                 1.261538     6.6483893   FALSE FALSE
## GarageFinish                1.433649     0.2741604   FALSE FALSE
## GarageCars                  2.230352     0.3427005   FALSE FALSE
## GarageArea                  1.653061    30.1576422   FALSE FALSE
## GarageQual                 10.053030     0.1370802   FALSE FALSE
## GarageCond                 10.861789     0.1370802   FALSE FALSE
## PavedDrive                 14.877778     0.2056203   FALSE FALSE
## WoodDeckSF                 20.026316    18.7799863   FALSE FALSE
## OpenPorchSF                22.620690    13.7765593   FALSE FALSE
## EnclosedPorch              83.400000     8.2248115   FALSE  TRUE
## X3SsnPorch                 59.791667     0.1370802   FALSE  TRUE
## ScreenPorch                11.577586     0.1370802   FALSE FALSE
## PoolArea                 1453.000000     0.4797807   FALSE  TRUE
## PoolQC                    728.500000     0.1370802   FALSE  TRUE
## Fence.GdPrv                23.728814     0.1370802   FALSE  TRUE
## Fence.GdWo                 26.018519     0.1370802   FALSE  TRUE
## Fence.MnPrv                 8.292994     0.1370802   FALSE FALSE
## Fence.MnWw                131.636364     0.1370802   FALSE  TRUE
## Fence.None                  4.192171     0.1370802   FALSE FALSE
## MiscFeature.Gar2          728.500000     0.1370802   FALSE  TRUE
## MiscFeature.None           26.018519     0.1370802   FALSE  TRUE
## MiscFeature.Othr          728.500000     0.1370802   FALSE  TRUE
## MiscFeature.Shed           28.775510     0.1370802   FALSE  TRUE
## MiscFeature.TenC         1458.000000     0.1370802   FALSE  TRUE
## MiscVal                    27.057692     0.1370802   FALSE  TRUE
## MoSold.April                9.347518     0.1370802   FALSE FALSE
## MoSold.August              10.959016     0.1370802   FALSE FALSE
## MoSold.December            23.728814     0.1370802   FALSE  TRUE
## MoSold.February            27.057692     0.1370802   FALSE  TRUE
## MoSold.January             24.596491     0.1370802   FALSE  TRUE
## MoSold.July                 5.235043     0.1370802   FALSE FALSE
## MoSold.June                 4.766798     0.1370802   FALSE FALSE
## MoSold.March               12.764151     0.1370802   FALSE FALSE
## MoSold.May                  6.151961     0.1370802   FALSE FALSE
## MoSold.November            17.468354     0.1370802   FALSE FALSE
## MoSold.October             15.393258     0.1370802   FALSE FALSE
## MoSold.September           22.158730     0.1370802   FALSE  TRUE
## YrSold                      1.027356     0.3427005   FALSE FALSE
## SaleType.COD               32.930233     0.1370802   FALSE  TRUE
## SaleType.Con              728.500000     0.1370802   FALSE  TRUE
## SaleType.ConLD            161.111111     0.1370802   FALSE  TRUE
## SaleType.ConLI            290.800000     0.1370802   FALSE  TRUE
## SaleType.ConLw            290.800000     0.1370802   FALSE  TRUE
## SaleType.CWD              363.750000     0.1370802   FALSE  TRUE
## SaleType.New               11.057851     0.1370802   FALSE FALSE
## SaleType.Oth              485.333333     0.1370802   FALSE  TRUE
## SaleType.WD                 6.598958     0.1370802   FALSE FALSE
## SaleCondition.Abnorml      13.445545     0.1370802   FALSE FALSE
## SaleCondition.AdjLand     363.750000     0.1370802   FALSE  TRUE
## SaleCondition.Alloca      120.583333     0.1370802   FALSE  TRUE
## SaleCondition.Family       71.950000     0.1370802   FALSE  TRUE
## SaleCondition.Normal        4.590038     0.1370802   FALSE FALSE
## SaleCondition.Partial      10.766129     0.1370802   FALSE FALSE
## SalePrice                   1.176471    45.4420836   FALSE FALSE
nzv_cols <- row.names(nzv[!grepl("Neighborhood", row.names(nzv)) & nzv$nzv, ])
if(length(nzv_cols) > 0) {
  train <- train[, -which(names(train) %in% nzv_cols)]
}

## identify and remove highly correlated predictors from training set
cor_preds <- cor(train[, -which(names(train) == "SalePrice")])
high_cor <- findCorrelation(cor_preds, cutoff = 0.80)
which(colnames(train) %in% 
        c("GrLivArea", "TotalBsmtSF", 
          "GarageCars", "FireplaceQu"))
## [1]  80  86  96 100
high_cor <- high_cor[!high_cor %in% c(80, 86, 96, 100)]
train <- train[, -high_cor]

Partition labeled training data

## partition training set for model testing on known sale prices
split <- createDataPartition(train$SalePrice, p = 0.75, list = FALSE)

training <- train[split, ]
testing <- train[-split, ]

Lookup regression models in caret

mods <- modelLookup()
mods <- mods[mods$forReg == TRUE, ]

Use 10-fold repeated cross validation for fitting

fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 5)

Train models

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(xgboost)
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-5
library(doParallel)
## Loading required package: iterators
## Loading required package: parallel
cl <- makeCluster(detectCores())
registerDoParallel(cl)

## model training and evaluation on partitioned `train` data
err <- data.frame(model = character(0), rmse = numeric(0),
                  stringsAsFactors = FALSE)
# set.seed(2017)
lm_fit <- train(log(SalePrice) ~ . - Id, data = training,
                method = "lm",
                preProc = c("center", "scale"),
                trControl = fitControl)
lm_fit
## Linear Regression 
## 
## 1096 samples
##  110 predictor
## 
## Pre-processing: centered (109), scaled (109) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 987, 985, 986, 986, 988, 985, ... 
## Resampling results:
## 
##   RMSE       Rsquared 
##   0.1265347  0.8994076
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
testing$SalePredict1 <- exp(predict(lm_fit, testing, na.action = na.pass))
RMSE(log(testing$SalePredict1), log(testing$SalePrice))
## [1] 0.1084855
sqrt(sum((log(testing$SalePredict1) - log(testing$SalePrice))^2) /
       nrow(testing))
## [1] 0.1084855
err[nrow(err) + 1, ] <-
  c("lm",
    RMSE(log(testing$SalePredict1),
         log(testing$SalePrice))
  )
summary(lm_fit)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10024 -0.04895  0.00052  0.05953  0.42364 
## 
## Coefficients: (1 not defined because of singularities)
##                               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                 12.0236951  0.0035326 3403.599  < 2e-16 ***
## MSSubClass.1.5StoryFin      -0.0015951  0.0049810   -0.320 0.748849    
## MSSubClass.1StoryNew         0.0150135  0.0093709    1.602 0.109446    
## MSSubClass.1StoryPUD         0.0058897  0.0065264    0.902 0.367038    
## MSSubClass.2StoryNew        -0.0094810  0.0081829   -1.159 0.246886    
## MSZoning.RL                  0.0066192  0.0067510    0.980 0.327092    
## LotFrontage                  0.0010386  0.0051466    0.202 0.840108    
## LotArea                      0.0414283  0.0067654    6.124 1.32e-09 ***
## Alley.None                   0.0023885  0.0044027    0.543 0.587591    
## LotShape.IR1                -0.0055124  0.0041212   -1.338 0.181351    
## LandContour.Lvl              0.0050235  0.0047689    1.053 0.292424    
## LotConfig.Corner             0.0216500  0.0086729    2.496 0.012712 *  
## LotConfig.CulDSac            0.0165252  0.0062359    2.650 0.008178 ** 
## LotConfig.Inside             0.0156417  0.0094183    1.661 0.097074 .  
## LandSlope.Gtl               -0.0035353  0.0047925   -0.738 0.460881    
## Neighborhood.Blmngtn         0.0018639  0.0067879    0.275 0.783692    
## Neighborhood.Blueste        -0.0013947  0.0040120   -0.348 0.728185    
## Neighborhood.BrDale         -0.0056862  0.0058493   -0.972 0.331225    
## Neighborhood.BrkSide         0.0030127  0.0097935    0.308 0.758439    
## Neighborhood.ClearCr         0.0056988  0.0071723    0.795 0.427059    
## Neighborhood.CollgCr        -0.0068298  0.0151880   -0.450 0.653035    
## Neighborhood.Crawfor         0.0178130  0.0096671    1.843 0.065681 .  
## Neighborhood.Edwards        -0.0203691  0.0121556   -1.676 0.094114 .  
## Neighborhood.Gilbert        -0.0066493  0.0100084   -0.664 0.506608    
## Neighborhood.IDOTRR         -0.0220977  0.0080829   -2.734 0.006371 ** 
## Neighborhood.MeadowV        -0.0131950  0.0068287   -1.932 0.053612 .  
## Neighborhood.Mitchel        -0.0118261  0.0084846   -1.394 0.163683    
## Neighborhood.NAmes          -0.0086617  0.0172791   -0.501 0.616283    
## Neighborhood.NoRidge         0.0122484  0.0083247    1.471 0.141519    
## Neighborhood.NPkVill         0.0012465  0.0051260    0.243 0.807920    
## Neighborhood.NridgHt         0.0155357  0.0113292    1.371 0.170596    
## Neighborhood.NWAmes         -0.0059288  0.0106598   -0.556 0.578213    
## Neighborhood.OldTown        -0.0133798  0.0140679   -0.951 0.341792    
## Neighborhood.Sawyer         -0.0097005  0.0105643   -0.918 0.358718    
## Neighborhood.SawyerW        -0.0025334  0.0097870   -0.259 0.795805    
## Neighborhood.Somerst         0.0150595  0.0119355    1.262 0.207339    
## Neighborhood.StoneBr         0.0150349  0.0066443    2.263 0.023864 *  
## Neighborhood.SWISU           0.0020092  0.0074797    0.269 0.788279    
## Neighborhood.Timber         -0.0054680  0.0081967   -0.667 0.504870    
## Neighborhood.Veenker                NA         NA       NA       NA    
## Condition1.Feedr             0.0002787  0.0047156    0.059 0.952879    
## Condition1.Norm              0.0195993  0.0048026    4.081 4.85e-05 ***
## BldgType                     0.0082033  0.0052831    1.553 0.120806    
## HouseStyle.1Story           -0.0005548  0.0105217   -0.053 0.957957    
## OverallQual                  0.0692656  0.0075947    9.120  < 2e-16 ***
## OverallCond                  0.0464160  0.0052461    8.848  < 2e-16 ***
## YearBuilt                    0.0474832  0.0123167    3.855 0.000123 ***
## YearRemodAdd                 0.0102665  0.0064397    1.594 0.111199    
## RoofStyle.Gable             -0.0052600  0.0041393   -1.271 0.204121    
## Exterior1st.Plywood         -0.0081813  0.0060467   -1.353 0.176362    
## `\\`Exterior1st.Wd Sdng\\`` -0.0156152  0.0051295   -3.044 0.002395 ** 
## Exterior2nd.HdBoard         -0.0091333  0.0055403   -1.649 0.099566 .  
## Exterior2nd.MetalSd         -0.0072419  0.0052553   -1.378 0.168509    
## Exterior2nd.Plywood         -0.0034097  0.0062517   -0.545 0.585602    
## Exterior2nd.VinylSd         -0.0057625  0.0069226   -0.832 0.405368    
## MasVnrType.BrkFace          -0.0030663  0.0046645   -0.657 0.511099    
## MasVnrType.Stone             0.0010222  0.0048098    0.213 0.831749    
## ExterQual                   -0.0017332  0.0067145   -0.258 0.796366    
## ExterCond                   -0.0010771  0.0041715   -0.258 0.796298    
## Foundation.BrkTil           -0.0127511  0.0109852   -1.161 0.246022    
## Foundation.CBlock           -0.0127907  0.0164526   -0.777 0.437094    
## Foundation.PConc            -0.0026350  0.0174666   -0.151 0.880116    
## BsmtQual                     0.0081686  0.0074235    1.100 0.271437    
## BsmtExposure                 0.0196917  0.0050041    3.935 8.90e-05 ***
## BsmtFinType1                 0.0031426  0.0058381    0.538 0.590504    
## BsmtFinSF1                   0.0061309  0.0175544    0.349 0.726974    
## BsmtFinType2.Unf             0.0034549  0.0066779    0.517 0.605015    
## BsmtUnfSF                   -0.0222625  0.0184907   -1.204 0.228886    
## TotalBsmtSF                  0.0428800  0.0173974    2.465 0.013881 *  
## HeatingQC                    0.0139327  0.0051194    2.722 0.006612 ** 
## CentralAir                   0.0160220  0.0047279    3.389 0.000730 ***
## Electrical                  -0.0068111  0.0043056   -1.582 0.113986    
## X1stFlrSF                   -0.0093755  0.0160481   -0.584 0.559209    
## X2ndFlrSF                   -0.0021743  0.0160424   -0.136 0.892216    
## GrLivArea                    0.1263049  0.0189686    6.659 4.58e-11 ***
## BsmtFullBath                 0.0208249  0.0056551    3.682 0.000243 ***
## BsmtHalfBath                 0.0031598  0.0040411    0.782 0.434458    
## FullBath                     0.0142209  0.0066098    2.151 0.031681 *  
## HalfBath                     0.0157826  0.0058762    2.686 0.007356 ** 
## BedroomAbvGr                -0.0045713  0.0061268   -0.746 0.455775    
## KitchenQual                  0.0157214  0.0061829    2.543 0.011151 *  
## TotRmsAbvGrd                 0.0114958  0.0082708    1.390 0.164863    
## Functional.Typ               0.0210546  0.0041203    5.110 3.87e-07 ***
## Fireplaces                   0.0096707  0.0071418    1.354 0.176019    
## FireplaceQu                  0.0097816  0.0073034    1.339 0.180775    
## GarageType                  -0.0051448  0.0059234   -0.869 0.385302    
## GarageYrBlt                 -0.0090382  0.0096018   -0.941 0.346778    
## GarageFinish                 0.0027146  0.0059826    0.454 0.650105    
## GarageCars                   0.0208901  0.0094218    2.217 0.026835 *  
## GarageArea                   0.0146619  0.0096095    1.526 0.127387    
## GarageCond                   0.0091355  0.0059961    1.524 0.127937    
## PavedDrive                   0.0064071  0.0045756    1.400 0.161743    
## WoodDeckSF                   0.0104103  0.0041593    2.503 0.012478 *  
## OpenPorchSF                  0.0012621  0.0042135    0.300 0.764591    
## ScreenPorch                  0.0123850  0.0038121    3.249 0.001198 ** 
## Fence.MnPrv                  0.0045636  0.0053075    0.860 0.390087    
## Fence.None                   0.0025446  0.0055795    0.456 0.648449    
## MoSold.April                -0.0001361  0.0045003   -0.030 0.975877    
## MoSold.August               -0.0006647  0.0043594   -0.152 0.878849    
## MoSold.July                  0.0042395  0.0048937    0.866 0.386525    
## MoSold.June                  0.0057991  0.0048867    1.187 0.235622    
## MoSold.March                 0.0026420  0.0043858    0.602 0.547044    
## MoSold.May                   0.0115567  0.0047221    2.447 0.014564 *  
## MoSold.November             -0.0011085  0.0041887   -0.265 0.791340    
## MoSold.October              -0.0052911  0.0042313   -1.250 0.211426    
## YrSold                      -0.0085618  0.0038305   -2.235 0.025631 *  
## SaleType.WD                 -0.0063754  0.0061477   -1.037 0.299972    
## SaleCondition.Abnorml       -0.0071029  0.0073715   -0.964 0.335503    
## SaleCondition.Normal         0.0280764  0.0094936    2.957 0.003176 ** 
## SaleCondition.Partial        0.0276435  0.0094632    2.921 0.003567 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.117 on 987 degrees of freedom
## Multiple R-squared:  0.9218, Adjusted R-squared:  0.9133 
## F-statistic: 107.8 on 108 and 987 DF,  p-value: < 2.2e-16
# set.seed(2017)
# rf_fit <- train(log(SalePrice) ~ . - Id, data = training,
#                 method = "rf",
#                 preProc = c("center", "scale"),
#                 trControl = fitControl)
# rf_fit
# testing$SalePredict2 <- exp(predict(rf_fit, testing, na.action = na.pass))
# RMSE(log(testing$SalePredict2), log(testing$SalePrice))
# sqrt(sum((log(testing$SalePredict2) - log(testing$SalePrice))^2) /
#        nrow(testing))
# 
# err[nrow(err) + 1, ] <-
#   c("rf",
#     RMSE(log(testing$SalePredict2),
#          log(testing$SalePrice))
#   )
# set.seed(2017)
# xgbLin_fit <- train(log(SalePrice) ~ . - Id, data = training,
#                     method = "xgbLinear",
#                     preProc = c("center", "scale"),
#                     trControl = fitControl)
# xgbLin_fit
# testing$SalePredict3 <- exp(predict(xgbLin_fit, testing, na.action = na.pass))
# RMSE(log(testing$SalePredict3), log(testing$SalePrice))
# sqrt(sum((log(testing$SalePredict3) - log(testing$SalePrice))^2) /
#        nrow(testing))
# 
# err[nrow(err) + 1, ] <- 
#   c("xgbLin", 
#     RMSE(log(testing$SalePredict3),
#          log(testing$SalePrice))
#   )
# set.seed(2017)
# xgbTree_fit <- train(log(SalePrice) ~ . - Id, data = training,
#                      method = "xgbTree",
#                      preProc = c("center", "scale"),
#                      trControl = fitControl)
# xgbTree_fit
# testing$SalePredict4 <- exp(predict(xgbTree_fit, testing, na.action = na.pass))
# RMSE(log(testing$SalePredict4), log(testing$SalePrice))
# sqrt(sum((log(testing$SalePredict4) - log(testing$SalePrice))^2) /
#        nrow(testing))
# 
# err[nrow(err) + 1, ] <- 
#   c("xgbTree", 
#     RMSE(log(testing$SalePredict4),
#          log(testing$SalePrice))
#   )
# set.seed(2017)
# ridge_fit <- train(log(SalePrice) ~ . - Id, data = training,
#                  method = "ridge",
#                  preProc = c("center", "scale"),
#                  trControl = fitControl)
# ridge_fit
# testing$SalePredict5 <- exp(predict(ridge_fit, testing, na.action = na.pass))
# RMSE(log(testing$SalePredict5), log(testing$SalePrice))
# sqrt(sum((log(testing$SalePredict5) - log(testing$SalePrice))^2) /
#        nrow(testing))
# 
# err[nrow(err) + 1, ] <-
#   c("ridge",
#     RMSE(log(testing$SalePredict5),
#          log(testing$SalePrice))
#   )
# set.seed(2017)
# glmnet_fit <- train(log(SalePrice) ~ . - Id, data = training,
#                  method = "glmnet",
#                  preProc = c("center", "scale"),
#                  trControl = fitControl)
# glmnet_fit
# testing$SalePredict6 <- exp(predict(glmnet_fit, testing, na.action = na.pass))
# RMSE(log(testing$SalePredict6), log(testing$SalePrice))
# sqrt(sum((log(testing$SalePredict6) - log(testing$SalePrice))^2) /
#        nrow(testing))
# 
# err[nrow(err) + 1, ] <-
#   c("glmnet",
#     RMSE(log(testing$SalePredict6),
#          log(testing$SalePrice))
#   )

err[order(err$rmse), ]
##   model              rmse
## 1    lm 0.108485464227176
## examine correlations across model predictions
# cor(testing[, (ncol(testing)-6):ncol(testing)])

## re-train models on entire training dataset
# set.seed(2017)
lm_full <- train(log(SalePrice) ~ . - Id, data = train,
                     method = "lm",
                     preProc = c("center", "scale"),
                     trControl = fitControl)
lm_full
## Linear Regression 
## 
## 1459 samples
##  110 predictor
## 
## Pre-processing: centered (109), scaled (109) 
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1313, 1311, 1314, 1313, 1314, 1314, ... 
## Resampling results:
## 
##   RMSE       Rsquared 
##   0.1202567  0.9099871
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
# set.seed(2017)
# rf_full <- train(log(SalePrice) ~ . - Id, data = train,
#                      method = "rf",
#                      preProc = c("center", "scale"),
#                      trControl = fitControl)
# rf_full
# set.seed(2017)
# xgbLin_full <- train(log(SalePrice) ~ . - Id, data = train,
#                      method = "xgbLinear",
#                      preProc = c("center", "scale"),
#                      trControl = fitControl)
# xgbLin_full
# set.seed(2017)
# xgbTree_full <- train(log(SalePrice) ~ . - Id, data = train,
#                      method = "xgbTree",
#                      preProc = c("center", "scale"),
#                      trControl = fitControl)
# xgbTree_full
# set.seed(2017)
# ridge_full <- train(log(SalePrice) ~ . - Id, data = train,
#                      method = "ridge",
#                      preProc = c("center", "scale"),
#                      trControl = fitControl)
# ridge_full
# set.seed(2017)
# glmnet_full <- train(log(SalePrice) ~ . - Id, data = train,
#                      method = "glmnet",
#                      preProc = c("center", "scale"),
#                      trControl = fitControl)
# glmnet_full

stopCluster(cl)

Make predictions & output to CSV

## linear model prediction
test$SalePrice <- exp(predict(lm_full, test, na.action = na.pass))
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient
## fit may be misleading
test <- test[, which(names(test) %in% names(train))]

## xgbTree prediction
# test$SalePrice <- exp(predict(xgbTree_full, test, na.action = na.pass))
# test <- test[, which(names(test) %in% names(train))]

## combine model predictions by finding mean prediction for each property
# test$SalePrice <- exp(
#   rowMeans(data.frame(
#     predict(lm_full, test, na.action = na.pass),
#     predict(rf_full, test, na.action = na.pass),
#     predict(xgbLin_full, test, na.action = na.pass),
#     predict(xgbTree_full, test, na.action = na.pass),
#     predict(ridge_full, test, na.action = na.pass),
#     predict(glmnet_full, test, na.action = na.pass)),
#     na.rm = TRUE)
# )

predictions <- data.frame(Id = test$Id, SalePrice = test$SalePrice)
head(predictions)
##     Id SalePrice
## 1 1461  119241.4
## 2 1462  162731.6
## 3 1463  178631.7
## 4 1464  198460.5
## 5 1465  195636.8
## 6 1466  169800.0
predictions[is.na(predictions$SalePrice), ]
## [1] Id        SalePrice
## <0 rows> (or 0-length row.names)
## save output, change filename as needed
# write.csv(predictions, file = "Submission_052317_lin2.csv", quote = FALSE, row.names = FALSE)

My best public root mean squared error (RMSE) score in Kaggle’s House Prices: Advanced Regression Techniques competition was 0.12179 (user name: janderman, display name: Judd Anderman), which was the result of my most recent modeling attempt following a few rounds of iteration, error checking, and refinement. In this case, I used only used my fitted linear model to predict property SalePrice in the unlabeled test dataset. From my perspective, the relative success of this last submission was a result of missing data imputation - in most cases missing data points were in fact meaningful and so were fairly easy to impute - and recoding of the predictors and target variables as seemed appropriate in each case, whether that involved performing log transformations, binary coding, or casting categorical variables as numeric ones.

*Addendum: I was able to achieve a slightly lower RMSE of 0.12033 on the public leaderboard data by averaging the output of several trained models applied to the test dataset, including the linear model I had used previously. This latter approach was significantly more computationally intensive and time-consuming for what appears to be a relatively modest gain in predictive performance. The relevant code is contained in the last couple of code chunks above but commented out, however, it can be found in a separate R markdown file. While I found it productive to partition the training data so that I could evaluate and compare the performance of different models against known sale prices, I did find that retraining my chosen model(s) on the full training dataset produced improved results. Still, my largest gains in RMSE, the competition’s evaluation metric, occured early on after more careful examination and deliberate processing and transformation of the supplied training and testing data.